Submission to metric track

Introduction

Deciding between man and zone coverage is one of the most critical strategic choices a defensive coordinator must make before each offensive play in American football. While experienced offensive coordinators and quarterbacks often rely on visual cues to identify these defensive schemes, the increasing availability of player tracking data offers a new avenue to uncover and analyze these tactics. A notable example is Amazon’s NFL Next Gen Stats model, which delivers coverage predictions during live broadcasts (see a snapshot of the 2024 Week 12 matchup between the Pittsburgh Steelers and Cleveland Browns). However, these models seem to be trained on plays without pre-snap motion, or at least to the situations before motions (see Amazon), which is a crucial element of modern offensive strategies.

Our project takes this model a step further. While we similarly predict man- or zone coverage when the teams are set, we further leverage the additional information of pre-snap player movements. Using a hidden Markov model (HMM), we model defenders’ trajectories based on hidden states, which represent the offensive players they may be guarding. Incorporating summary statistics of the probabilistic HMM results as covariates into the existing pre-motion model significantly improves both the AUC and detection accuracy and further allows for evaluating the effectiveness of pre-snap motion in uncovering defensive strategies, providing real-time tactical insights for coaches.

Coverage Prediction

Data

We analyze tracking data from nine weeks of the NFL 2022 season, provided by the NFL Big Data Bowl 2024. Beside the tracking data, we also use information on plays and players. We further considered the corresponding data from PFF that assigned the categories , and representing the different schemes to each play. As it is not properly described what means, we omit every play that is associated with this value. Moreover, we omit plays with more than five offensive linemen and with two quarterbacks and those plays that did not contain any pre-snap motion. Then, we end up with XY offensive plays in total, from which the defense played Y in zone and X in man coverage.

Feature engineering

To accurately forecast the defensive scheme (man- or zone defense) for every play, we need to create various features derived from the tracking data. In particular, we conducted the following feature engineering steps: We first consider all 11 players on each side of the field and compute features related to the convex hull of the positions of the players. In particular, for defense and offense, we compute the area spanned by the convex hull of all player such as well as the largest \(y\) distance (i.e. the width of the hull) and the largest \(x\) distance (i.e. the length of the hull). In addition, we select the five most relevant players on each side of the field. For offense, we omit the offensive line and the QB, while, for defense, we omit nose tackles, defensive tackles and defensive ends, and select the five defenders that were the closest to the five attackers corresponding to a weighted euclidean distance, putting much more emphasis on the y-axis. Finally, we use their standardized \(x\) and \(y\) coordinates as covariates and order defensive and offensive players according to their \(y\) coordinates.. (Distances, orientation?? Appenix?? Rouven: würde sagen ja)

Analysis

Our analysis comprises different steps:

1. Pre-motion prediction

We train different models (LASSO, Random Forest, XGBoost) to predict whether the defense plays a man- or zone coverage scheme. In particular, …..

The model uses the previously described features, blablabla.

ROBERT

2. Hidden Markov model

We describe the movements of the five defensive players during the phase of pre-snap motion within a hidden Markov model (see the Appendix for a thorough description). In particular, we assume the defender’s \(y\)-coordinates in each time step \(t\) to be realizations of Gaussian distributions with means according to the \(y\)-coordinates of the offensive players to be guarded (these are the underlying states) and a standard deviation to be estimated (see Franks et al. 2015 for a similar approach in basketball).

OLE

The results of the HMM are exemplified using the following video. It displays a touchdown from the Kansas City Chiefs against the Arizona Cardinals in Week 1 of the 2022 NFL season. We can see that, pre-snap, Mecole Hardman (KC #17) is in motion. He is immediately followed by the defender Marco Wilson (AZ #20), which is a clear indication for man-coverage.

This probabilistic approach allows us to infer dynamic coverage patterns during plays with pre-snap motion. To exploit this high-dimensional time-series data, we calculate the entropy of the state probabilities, thus, producing measures of uncertainty for each play.

3. Post-motion prediction

We re-train the pre-motion model to predict whether the defense plays a man- or zone coverage scheme, however, in this step, we incorporate results from the HMM analysis as further covariates. In particular, we use the aforementioned state probabilities of guarding the different offensive players in each second.

To remediate this, we derive the decision of man- or zone coverage from the number of switches for individual players. In particular, a low number of switches when offensive players are in motion indicates man coverage whereas a higher number indicates zone coverage.

Entropie

Results

By comparing the predictive performance of our model without motion and our post-motion model we can determine the effectiveness of player movements before the snap to detect the correct defensive scheme. Moreover, we assess which teams predominantly apply pre-snap motions to increase the likelihood of correctly identifying the applied defensive strategy.

auc da unbalanced data Hier die Animation rein mit den Verbindungen von den decodierten States

robustheit checken mit simplen summarys nach der motion

tests mit conditional independence (vllt in Anhang)

team analysen

Discussion

One limitation of our approach lies in the imperfect prediction accuracy of the pre-motion model, primarily due to insufficient hyperparameter tuning and the relatively small number of plays involving motion. However, the primary focus of this project was on pre-snap motion, particularly on how to effectively translate this information into the hidden Markov model. Importantly, our pre-motion model is modular and can be seamlessly replaced by another model, such as the NFL Next Gen Stats model, which can then be integrated into our modeling extension to fully leverage the insights provided by pre-snap motion.

Code

All code for data pre-processing, model training, prediction and player evaluation can be found here.

References

*Franks A, Miller A, Bornn L, Goldsberry K (2015). Characterizing the Spatial Structure of Defensive Skill in Professional Basketball. The Annals of Applied Statistics, 9(1), DOI:10.1214/14-AOAS799

*Zucchini W, MacDonald I, Langrock R (2016). Hidden Markov Models for Time Series - An Introduction Using R. CRC Press

Appendix

Hidden Markov Model

A hidden Markov models consists of an observed time series \(\{\boldsymbol{y}_t\}_{t=1}^T\) — here, the y-coordinates of the defensive players and an unobserved first-order Markov chain \(\{ g_t\}_{t=1}^T\), with \(g_t \in \{1,\ldots,N\}\) which proxies the offensives players to be guarded at every time point \(t\). The Markov chain is fully described by an initial distribution \(\boldsymbol{\delta}=\bigl( \Pr(g_1=1), \ldots, \Pr(g_1=N) \bigr)\) and a transition probability matrix (t.p.m.) \(\boldsymbol{\Gamma} = (\gamma_{ij}),\) with \(\gamma_{ij} = \Pr(g_t = j| g_{t-1} = i), \ i,j = 1, \ldots, N\). The connection of both stochastic processes arises from the assumption that the distribution of the observations \(\boldsymbol{y}_t\) are fully determined by the currently active state, i.e.  \[\begin{equation*} f(\boldsymbol{y}_t|g_1, \ldots, g_T, \boldsymbol{y}_1, \ldots, \boldsymbol{y}_{t-1},\boldsymbol{y}_{t+1},\ldots,\boldsymbol{y}_T) = f(\boldsymbol{y}_t|g_t). \end{equation*}\] In general, \(f\) can be any density or probability mass function depending on the type of data. Following the approaches of Franks et al. (2015), we opt for a Gaussian distribution.